Walkthrough: PII Redaction
The PII Redaction SDK allows you to easily use state-of-the-art models to remove PII information from text.
!pip install datasets
# This installs the version of spaCy compatible with CUDA 12.x for GPU use.
# To successfully use GPU acceleration, ensure that your system has a compatible CUDA version installed.
# Specific steps for setting up sGPU acceleration can be found here: https://spacy.io/usage
!pip install spacy'[cuda12x]'
Import the find_pii
function from the dynamofl.privacy
library.
import json
from datasets import load_dataset
from dynamofl.privacy import find_pii
1. Redact PII using the find_pii()
method
The SDK simplifies Personally Identifiable Information (PII) redaction with an easy-to-use function called find_pii()
.
The find_pii()
method returns a dictionary with the following keys:
- f the text is of type
str
:redacted_text
(str): The redacted string with PII removed.redacted_entities
(dict): A dictionary where keys are entity types (e.g., 'names', 'emails') and values are lists of the redacted entities found.redacted_entity_positions
(list of tuples): Positions of redacted entities.
- If the text is of type
List[str]
orDataset
(HuggingFace dataset):redacted_dataset
(Lisr[str] or Dataset): The list or dataset of strings where PII has been redacted.redacted_entities
(dict): A dictionary where keys are entity types (e.g., 'names', 'emails') and values are lists of the redacted entities found.
text_with_pii= """DynamoFL which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""
# Redact PII with spaCy
pii_results = find_pii(
model_type="transformers",
text=text_with_pii,
entity_types=["PERSON", "ORG", "DATE", "MONEY"],
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))
[ORG] which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, [DATE] announced that it raised <MONEY> in a Series A funding round led by [ORG] and [ORG].
{ "ORG": { "[ORG]": [ "Nexus Venture Partners", "Canapi Ventures", "DynamoFL" ] }, "MONEY": { "[MONEY]": [ "$15.1 million" ] }, "DATE": { "[DATE]": [ "today" ] } }
2. Use any public or custom model
With the SDK, you have the flexibility to use any public or custom flair or spaCy model for PII redaction. This capability is made possible through the model_config
parameter in the find_pii()
function. Currently, the PII SDK supports English token classification models. To configure your model, use the following format: {"lang_code": "en", "model": "<your model ID or path>"}
It is worth noting that open-source NER providers do not provide support for a wide range of models. The SDK ensures a consistent experience across all providers (transformers
, spacy
, flair
and presidio
), allowing you to incorporate various models into your PII redaction workflow. Please refer to section (3) on how to redact custom entites using your custom models.
Each provider supports the following models:
- transformers: Any public token classification model listed here, or any custom model.
- spaCy: Any public model listed here, or any custom model.
- flair: Any public model listed here, or any custom model.
- presidio: Any public model listed here, or any custom model.
text_with_pii = """DynamoFL, which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""
pii_results = find_pii(
model_type="transformers",
text=text_with_pii,
entity_types=["ORG"],
model_config={"lang_code": "en", "model": "dslim/bert-base-NER"},
)
print(pii_results["redacted_text"])
print()
print(json.dumps(pii_results["redacted_entities"], indent=2))
[ORG], which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, today announced that it raised $15.1 million in a Series A funding round led by [ORG] and [ORG].
{ "ORG": { "[ORG]": [ "Nexus Venture Partners", "Canapi Ventures", "DynamoFL" ] } }
3. Redact strings, HuggingFace datasets and custom datasets
The SDK simplifies the process of redacting personally identifiable information (PII) across various data structures. Whether you are dealing with individual strings, HuggingFace datasets, or custom datasets, the find_pii()
function provides a straightforward and unified approach to safeguard sensitive information. The examples shown below use spacy but presidio or flair may also be used.
3.1 Redact individual strings
text_with_pii= """DynamoFL which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""
pii_results = find_pii(
model_type="transformers",
text=text_with_pii,
entity_types=["PERSON", "ORG", "DATE", "MONEY"],
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))
[ORG] which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, [DATE] announced that it raised [MONEY] in a Series A funding round led by [ORG] and [ORG].
{ "ORG": { "[ORG]": [ "Nexus Venture Partners", "Canapi Ventures", "DynamoFL" ] }, "MONEY": { "[MONEY]": [ "$15.1 million" ] }, "DATE": { "[DATE]": [ "today" ] } }
3.2 Redact a HuggingFace dataset
dataset = load_dataset("tweet_eval", "stance_climate")
pii_results = find_pii(
model_type="transformers",
text=dataset,
dataset_config={
"text_column": "text",
"train_name": "train",
},
entity_types = ["CARDINAL", "DATE", "EVENT", "FAC", "GPE",
"LANGUAGE", "LAW", "LOC", "MONEY", "NORP",
"ORDINAL", "ORG", "PERCENT", "PERSON", "PRODUCT",
"QUANTITY", "TIME", "WORK_OF_ART"
],
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
# Display the keys as the entity dictionary is large
print(pii_results["redacted_entities"].keys())
dict_keys(['PERSON', 'GPE', 'ORG', 'TIME', 'EVENT', 'CARDINAL', 'QUANTITY', 'LOC', 'NORP', 'DATE', 'ORDINAL', 'WORK_OF_ART', 'FAC', 'PRODUCT', 'MONEY', 'LAW', 'PERCENT'])
3.3 Redact a custom dataset
The PII SDK accepts custom datasets as a list of strings. This is done for generalizablity.
custom_dataset = [
"""
DynamoFL, which offers software to bring large language models (LLMs) to
enterprises and fine-tune those models on sensitive data, today announced
that it raised $15.1 million in a Series A funding round co-led by
Canapi Ventures and Nexus Venture Partners.
""",
"""
The tranche, with had participation from Formus Capital and Soma Capital,
brings DynamoFL’s total raised to $19.3 million. Co-founder and CEO Vaikkunth Mugunthan
says that the proceeds will be put toward expanding DynamoFL’s product
offerings and growing its team of privacy researchers.
""",
"""
Taken together, DynamoFL’s product offering allows enterprises to develop private
and compliant LLM solutions without compromising on performance,” Mugunthan told
TechCrunch in an email interview.
""",
]
pii_results = find_pii(
model_type="transformers",
text=custom_dataset,
entity_types=["PERSON", "ORG", "DATE", "MONEY"],
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(json.dumps(pii_results["redacted_entities"], indent=2))
{ "ORG": { "[ORG]": [ "Nexus Venture Partners", "Canapi Ventures", "DynamoFL", "Soma Capital", "Formus Capital", "TechCrunch" ] }, "MONEY": { "[MONEY]": [ "19.3" ] }, "DATE": { "[DATE]": [ "today" ] }, "PERSON": { "[PERSON]": [ "Vaikkunth Mugunthan", "Mugunthan" ] } }
4. Supported entity types
The PII Redaction SDK supports the following pre-defined entity types for English models:
-
transformers: Entity types to be redacted must be specified using the
entity_types
parameter in thefind_pii()
function.dynamofl-sandbox/pii-roberta-large
supports the following 18 entity classes: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART -
spaCy: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
-
flair: Flair models offer two pre-defined entity configurations:
- 4 class: PER, ORG, LOC, MISC
- 18 class: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
-
presidio (uses spacy models under the hood): CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
NOTE 1: Optionally, a subset can be specified using the entity_types
parameter in the find_pii()
function. This parameter must be specified in the following cases:
NOTE 2: Dynamofl recommends using dynamofl-sandbox/pii-roberta-large
for redacting sensitive information.
text_with_pii = """DynamoFL, which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""
pii_results = find_pii(
model_type="transformers",
text=text_with_pii,
entity_types=["DATE"], # Redact only the date entity; use a subset of the entity types supported by the model
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results["redacted_text"])
print()
print(pii_results["redacted_entities"])
DynamoFL, which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, <DATE> announced that it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners.
{'DATE': {'[DATE]': ['today']}}
5. Regex and exact match support
The PII Redaction SDK also supports regex entity types and exact match entity types with Transformers and Presidio. To do so, use the custom_entity_config
parameter. Set it to a configuration dictionary, including all the regex and exact match entity types to tag. This support will be extended to spaCy and flair soon!
The custom_entity_config is a nested dictionary containing the following keys:
- *entity_type: This signifies the type of custom entity and serves as the key.
- recognizer_type: This indicates the type of recognizer, which can be 'regex' or 'deny-list'.
- deny_list: This is a list of patterns to be added to the deny-list if the 'deny-list' recognizer is used.
- regex: This is a regular expression pattern to be used if the 'regex' recognizer is chosen.
dataset = load_dataset("tweet_eval", "stance_climate")
# Define parameters for custom entity types
custom_entity_config = {
"CLIMATE_PII": {
"recognizer_type": "deny-list",
"deny_list": ["#SemST", "#environment", "#COP21"],
},
"MENTION": {
"recognizer_type": "regex",
"regex": r"(@\w+\s*)+",
"score": 1.0,
},
}
pii_results = find_pii(
model_type="transformers",
text=dataset,
dataset_config={
"text_column": "text",
"train_name": "train",
},
# the dataset has no WORK_OF_ART entities, so this entity type will not be present in the results
entity_types=["MENTION", "CLIMATE_PII", "WORK_OF_ART"],
custom_entity_config=custom_entity_config,
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(json.dumps(pii_results["redacted_entities"], indent=2))
{ "CLIMATE_PII": { "[CLIMATE_PII]": [ "#SemST", "#environment", "#COP21" ] }, "MENTION": { "[MENTION]": [ "@user ", "@user @user ", "@user @user @user @user @user ", "@user @user @user @user ", "@TonyAbbottMHR ", "@msimire ", "@user @user @user ", "@user @user @user @user ", "@solarimpulse ", "@CreeClayton ", "@RobSilver ", "@potus ", "@ClimatParis2015", "@quinn43 ", "@MexONU ", "@user @user ", "@user ", "@BlissTabitha ", "@user @user " ] }, "WORK_OF_ART": { "[WORK_OF_ART]": [ "#ChasingIce", ""The Whale and the Supercomputer. On the Northern Front of Climate Change", "The Biggest Story In World Podcast", "TheBachelorette", "CaptainPlanet", "bible", "futurama", ""Crimes of the Hot"", "Futurama" ] } }
6. PII Post-processing
The PII SDK includes a post-processing feature that can be seamlessly integrated into your workflow after detecting personally identifiable information (PII). This is a user-defined function that is designed to dynamically replace identified PII with redacted text and can be applied to both pre-defined and custom entities. The example shown below uses presidio to dynamically redact a regex entity type.
import re
text_with_age = "Jack is 9. John is 26. Jill is 18."
# Create a callback for processing age mentions
def age_custom_redacted_text(match):
age = int(match)
redacted_tag = ""
if age < 10:
redacted_tag = "[<10]"
elif 10 <= age <= 20:
redacted_tag = "[10-20]"
else:
redacted_tag = "[>20]"
return redacted_tag
custom_entity_config = {
"AGE": {
"recognizer_type": "regex",
"regex": r"\b\d{1,2}\b",
"score": 1.0,
"redacted_text_callback": age_custom_redacted_text,
},
}
pii_results = find_pii(
model_type="transformers",
text=text_with_age,
entity_types=["AGE", "PERSON"],
custom_entity_config=custom_entity_config,
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))
[PERSON] is [<10]. [PERSON] is [>20]. [PERSON] is [10-20].
{ "AGE": { "[10-20]": [ "18" ], "[>20]": [ "26" ], "[<10]": [ "9" ] }, "PERSON": { "[PERSON]": [ "Jill", "John", "Jack" ] } }
7: Unique anonymization
The SDK's "unique anonymization" feature introduces a technical method for assigning consistent identifiers to the entities found within a document. identical entities across different parts of the text will share the same unique identifier.
text_with_repeated_entities = "John Doe lives in New York. James Doe recently moved to the same city."
'''
result_flair = find_pii(model_type="flair", text=text_with_repeated_entities, unique_anonymization=True)
result_spacy = find_pii(model_type="spacy", text=text_with_repeated_entities, unique_anonymization=True)
result_presidio = find_pii(model_type="presidio", text=text_with_repeated_entities, unique_anonymization=True)
'''
pii_results = find_pii(
model_type="transformers",
text=text_with_repeated_entities,
entity_types=["PERSON", "GPE"],
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
unique_anonymization=True
)
print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))
[PERSON_1] lives in [GPE_1]. [PERSON_2] recently moved to the same city.
{ "PERSON": { "[PERSON_2]": [ "James Doe" ], "[PERSON_1]": [ "John Doe" ] }, "GPE": { "[GPE_1]": [ "New York" ] } }
8. Combinations of unique and non-unique anonymization
The PII SDK offers a versatile feature that allows the combination of unique and non-unique anonymization strategies. This feature is particularly useful when dealing with various types of personally identifiable information (PII) within a dataset.
Unique anonymization: Unique anonymization ensures that each instance of a PII entity is replaced with a distinct identifier.
Non-unique anonymization: Non-unique anonymization, on the other hand, replaces each instance of a PII entity with a common identifier.
Combinations of anonymization strategies: The PII SDK enables the combination of unique and non-unique anonymization strategies within a single redaction process. This can be achieved through the unique_anonymization
parameter, allowing users to specify the "unique anonymization" strategy for certain entity types.
text_with_pii = "Google is a multinational technology company, its CEO is Sundar Pichai. Apple is a global technology company, its CEO is Tim Cook."
pii_results = find_pii(
model_type="transformers",
text=text_with_pii,
entity_types=["ORG", "PERSON"],
unique_anonymization=["PERSON"],
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(json.dumps(pii_results['redacted_entities'], indent=2))
[ORG] is a multinational technology company, its CEO is [PERSON_1]. [ORG] is a global technology company, its CEO is [PERSON_2].
{ "PERSON": { "[PERSON_2]": [ "Tim Cook" ], "[PERSON_1]": [ "Sundar Pichai" ] }, "ORG": { "[ORG]": [ "Apple", "Google" ] } }
9. Identify but prevent redaction of certain types of PII
The PII SDK allows users to selectively identify and exclude specific types of personally identifiable information (PII) from the redaction process. This parameter is useful when there's a need to retain certain PII elements in their original form. This is done through the no_redact
parameter. The example shown below uses spacy, as usual.
NOTE: The no_redact
parameter can be combined with the unique_anonymization
parameter or can be applied to custom entity types, enabling complex PII redaction.
text_with_pii = string = """DynamoFL which offers software to bring large language models (LLMs)
to enterprises and fine-tune those models on sensitive data, today announced that
it raised $15.1 million in a Series A funding round led by Canapi Ventures and Nexus Venture Partners."""
pii_results = find_pii(
model_type="transformers",
text=string,
no_redact=["MONEY"],
entity_types=["PERSON", "ORG", "DATE", "MONEY"],
model_config={"lang_code": "en", "model": "dynamofl-sandbox/pii-roberta-large"},
)
print(pii_results['redacted_text'])
print()
print(pii_results['redacted_entities'])
[ORG] which offers software to bring large language models (LLMs) to enterprises and fine-tune those models on sensitive data, [DATE] announced that it raised $15.1 million in a Series A funding round led by [ORG] and [ORG].
{'ORG': {'[ORG]': ['Nexus Venture Partners', 'Canapi Ventures', 'DynamoFL']}, 'MONEY': {'15.1 million']}, 'DATE': {'[DATE]': ['today']}}
10. Enchanced redaction speed
Our SDK brings about a notable enhancement in redaction speed, employing an algorithm with a time complexity of O(n log L), where:
- 'n': Represents the number of personally identifiable information (PII) instances.
- 'L': Denotes the average length of PII.
It's essential to highlight that both spaCy and Flair do not offer native support for redaction.
SpaCy can only determine the positions of PII in the original text; it lacks the capability to redact the text.
From the spaCy documentation:
The standard way to access entity annotations is the
doc.ents
property, which produces a sequence ofSpan
objects. The entity type is accessible either as a hash value or as a string, using the attributesent.label
andent.label_
. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.
From the flair documentation:
Entities in this case are Span objects that have a number of fields you can access, such as .text. You can also iterate through all tokens of a span and access their text, idx and other fields:
from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner')
sentence = Sentence('George Washington went to Washington .')
tagger.predict(sentence)
for entity in sentence.get_spans('ner'):
# print entity
print(entity)
# print only the entity text
print(entity.text)
# go through each token in entity and print its idx
for token in entity:
print(token.idx)